In this unit, we’re going to learn how to make graphs. The urge to present data in a pictorial format is an ancient one, and you are sure to find a primordial satisfaction in learning how to do so effectively.
What are the benefits of displaying data visually?
Let’s take a moment to review what we know about variables. This will prove beneficial later.
There are many ways to present data visually, including but not limited to:
A histogram has:
It is used to:
A density plot has:
It is used to:
Histograms and Density plots are highly compatible and convey similar information.
They can be overlaid to make a nice graph.
A scatterplot has:
It is used to:
A scatterplot also often includes:
This line might be:
A line graph has:
It is used to:
A bar graph has:
It is used to:
A bar graph often also has:
Error bars are used to:
A box plot has:
These axes can be switched.
What does a box-plot show?
A box plot is used to:
A violin plot is a variation on the box plot.
Instead of using boxes and lines, it simply shows a sideways density plot for each group.
A violin plot and box plot can be combined for an extra informative (and classy) graph.
A violin plot and box plot can be combined for an extra informative (and classy) graph.
A bubble plot takes a variety of forms.
It is usually either:
The key feature of a bubble plot is that each point is scaled to reflect the value of some third variable
It is used to:
A bubble plot takes a variety of forms.
It is usually either:
The key feature of a bubble plot is that each point is scaled to reflect the value of some third variable
It is used to:
A heat map is another way to represent a 3rd variable
The difference is that the bubble plot uses size to represent some third variable’s relationship to two variables, while a heat map uses color. The color usually scales from ‘cool’ colors (blues) to ‘hot’ colors (reds) - hence the name heat map.
Like a bubble plot, a heat map is usually either:
some other examples can be found here: (https://r-graph-gallery.com/heatmap.html)
Identify each of the graphs below. What kind is it?
| One continuous | Two continuous | One categorical, one continuous | Three, at least one continuous |
|---|---|---|---|
| ? | |||
| ? | |||
| ? |
| Participant | Condition | Observation |
|---|---|---|
| 1 | A | 5 |
| 2 | A | 2 |
| 3 | A | 7 |
| 4 | A | 4 |
| 5 | A | 1 |
| 6 | B | 10 |
| 7 | B | 4 |
| 8 | B | 10 |
| 9 | B | 5 |
| 10 | B | 9 |
| Musher | Checkpoint | Time |
|---|---|---|
| Jerry Sousa | 1 | 243 |
| Jerry Sousa | 2 | 176 |
| Jerry Sousa | 3 | 304 |
| Jerry Sousa | 4 | 201 |
| Melissa Owens | 1 | 215 |
| Melissa Owens | 2 | 421 |
| Melissa Owens | 3 | 334 |
| Melissa Owens | 4 | 220 |
For the next three questions, we will look at the mtcars data set, which comes pre-loaded in R. To learn more about this data set, check the help page (?mtcars)
select(mtcars, c(1:6)) %>% # We'll just look at the first 6 columns
knitr::kable()
| mpg | cyl | disp | hp | drat | wt | |
|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 |
| Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 |
| Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 |
Every time you create a plot with ggplot, follow these 7 steps:
ggplot(data = diamonds)
Don’t bother to run this yet, you won’t get anything.
PRO TIP: In R Markdown, create a new code chunk for each plot you make and keep other code out of these chunks as much as possible. This way, you can easily move plots around within your document and customize them as desired.
ggplot(data = diamonds, aes(x = carat, y = price))
Now we’ve specified what the x and y axis should be, and ggplot has laid it out for us. Next, we need to put something on the graph.
Now that we have the ‘foundation’ of our plot, we can add layers, using +
The type of layer(s) you add will determine what kind of plot you make.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
Now we’ve specified what the x and y axis should be, and ggplot has laid it out for us. Next, we need to put something on the graph.
Notice how I used the + to add a new piece to this command. The ggplot package is unique in stringing together functions in this way.
The geom_point() command adds a layer of points to our plot. There are lots of geoms you can add. We’ll see a few of the most useful ones later.
The real fun of ggplot is the ability to customize your plots - to make them look exactly how you want. Struggling to decide between a career as a data scientist or as an artist? Making plots with ggplot lets you do both!
Different geoms can take different arguments, called ‘aesthetics’, that can be customiezed. Some common ones are size, color, shape, and fill.
The help page for each geom function tells you what aesthetics are available for it.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(size = 2,
color = "blue",
shape = 1,
stroke = 1)
In ggplot, we can specify aesthetics in 2 different ways:
This is what we do when we want some aesthetic property of our graph to be constant. The previous example used this method, making all of the points blue circles at are size 2
This is what we do when we want some property of the graph to vary based on some variable in our data. For example, if we want the color of the points to vary depending on the clarity of the diamonds, and the shape of the points to vary based on the cut of the diamonds, we put these arguments inside aes():
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(size = 2,
aes(color = clarity, shape = cut))
Notice that the dots come in different colors and shapes now. Also, there are now legends that helps us interpret these colors and shapes. Finally, notice that size = 1 is NOT inside aes(), because we want this to be constant, applying to ALL points in the same way.
Consider step 2 again:
ggplot(data = diamonds, aes(x = carat, y = price))
Why are the x and y axes defined inside an aes()?
We don’t have to stop at just 1 layer. Let’s add a trendline.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(size = 2,
aes(color = clarity, shape = cut)) +
geom_smooth(color = "black", fill = "white")
Each new layer is added on top of the previous layer - notice how the line covers the points. If we switch the order, we can move layers up or down.
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_smooth(color = "black", fill = "white") +
geom_point(size = 2,
aes(color = clarity, shape = cut))
But its hard to see the line this way, so I think I liked the first order better.
Now that the plot looks the way we want, we can add some further customization.
Let’s start with 3 things:
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(size = 2,
aes(color = clarity, shape = cut)) +
geom_smooth(color = "black", fill = "white") +
theme_bw() +
labs(
title = "Diamond Plot",
subtitle = "in case you are interested",
caption =
"Fig. 1. Some relevant data about a girl\'s best friend",
x = "How Many Carats?",
y = "Price (in $)",
tag = NULL,
#Useful if your figure is part of a larger multi-panel figure
alt = "Oops, the plot is missing"
# Alt text for websites when the plot doesn't load
)
Now let’s briefly discuss theme(). You can use this to further customize your graph in almost any way you want, including:
Here’s a simple example, mostly focused on customizing the legend:
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(size = 2,
aes(color = clarity, shape = cut)) +
geom_smooth(color = "black", fill = "white") +
theme_bw() +
labs(
title = "Diamond Plot",
subtitle = "in case you are interested",
caption =
"Fig. 1. Some relevant data about a girl\'s best friend",
x = "How Many Carats?",
y = "Price (in $)"
) + theme(text = element_text(size = 20),
legend.position = c(0.9, 0.4),
legend.text = element_text(size = 5),
legend.title = element_text(size = 8),
legend.key.height = unit(0.1, "in"))
If you’re using R Markdown (and you should be!), then you can skip this step - your plots will show up in your document automatically (after you run your code).
But if we wanted to save our beautiful graph to an image file, we would use ggsave().
We can specify the file type (by giving the file name an appropriate extension, like .png) width, height, and dots per inch.
ggsave("MyBeautifulPlot.png", width = 6, height = 4, units = "in", dpi = 300)
Before we make a new graph from scratch, let’s practice adjusting the aesthetics on an existing one.
iditaroddata <- cbind(c(rep("Jerry Sousa", 4), rep("Melissa Owens", 4)), c(1:4, 1:4), c(243,
176,
304,
201,
215,
421,
334,
220
)) %>% data.frame() %>% rename(Musher = X1, Checkpoint = X2, Time = X3)
ggplot(data = iditaroddata, aes(x = Checkpoint, y = Time, group = Musher)) +
geom_line() +
geom_point(size = 5)
When making your graphs, keep the following in mind:
I’ll be using data from the nycflights13 package for these demos. So, let’s load it.
if (!require("nycflights13")) install.packages("nycflights13")
library(nycflights13)
This package includes the flights data set, over 336776 flights in and out of NYC in 2013. The date, time, scheduled time, carrier, origin, destination, air time, and distance traveled by each flight are included data points.
You know how to explore a data set by now! Take a moment to look at the flights data before moving on.
Let’s start by making a histogram. We’ll look at every air traveler’s worst enemy: departure delays.
First we’ll set up ggplot, then add a histogram layer with geom_histogram().
In the code below, notice that no Y axis is specified. For histograms (and density plots), only the X axis is needed. The Y axis is computed from the data.
ggplot(data = flights, aes(x = dep_delay)) + geom_histogram()
Let’s make our graph look better. We’ll make 3 changes:
ggplot(data = flights, aes(x = dep_delay)) +
geom_histogram(binwidth = 2, color = "darkblue", fill = "lightblue")
We don’t need any more layers, so we’ll skip step 5: add further layers.
We need to further finalize the plot. There are 3 changes I want to make:
ggplot(data = flights, aes(x = dep_delay)) +
geom_histogram(binwidth = 2, color = "darkblue", fill = "lightblue") +
coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
labs(
x = "Departure Delay (minutes)",
y = "How Many Flights?",
title = expression(paste(bold("Figure 1."), " Histogram of Flight Departure Delays"))
) + theme_bw()
Next, let’s make a density plot. We’ll start with our histogram, but change the geom_histogram to geom_density:
ggplot(data = flights, aes(x = dep_delay)) +
geom_density() +
coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
labs(
x = "Departure Delay (minutes)",
y = "Probability",
title = expression(paste(bold("Figure 2."), " Density Plot of Flight Departure Delays"))
) + theme_bw()
That’s OK, but we can do better. Let’s customize the density plot layer by:
ggplot(data = flights, aes(x = dep_delay)) +
geom_density(color = "darkblue", linetype = 2, fill = "lightblue") +
coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
labs(
x = "Departure Delay (minutes)",
y = "Probability",
title = expression(paste(bold("Figure 2."), " Density Plot of Flight Departure Delays"))
) + theme_bw()
I love putting the density plots and histograms together. They’re highly compatible.
BUT, this takes a bit of work because the Y axes are different for these two plot types and we need to make them match.
Notice:
ggplot(data = flights, aes(x = dep_delay)) +
geom_density(color = "black", linetype = 1, size = 1) +
geom_histogram(aes(y = ..density..), binwidth = 2, color = "darkblue", fill = "lightblue", alpha = 0.75) + # We have to put these two plots on the same scale
coord_cartesian(xlim = c(-50, 200)) + # This zooms in the graph
labs(
x = "Departure Delay (minutes)",
y = "Probability",
title = expression(paste(bold("Figure 3."), " Density Plot (and Histogram) of Flight Departure Delays"))
) + theme_bw()
It’s scatterplot time!
In this graph, we’ll show the relationship between distance traveled and time in the air. This one takes longer to run, because there are many points to plot
Note: google “ggplot shapes” to find the number codes for different shapes. Or go here: (http://sape.inf.usi.ch/quick-reference/ggplot2/shape) Only shapes 21-25 can take a fill aesthetic that is different from their color.
ggplot(data = flights, aes(y = distance, x = air_time)) +
geom_point(color = "black", fill = "gray", shape = 21, size = 3) +
labs(
y = "Flight Distance (miles)",
x = "Flight Time (minutes)",
title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
) + theme_bw()
But now, let’s add a trend line using geom_smooth.
We’ll add 3 arguments:
ggplot(data = flights, aes(y = distance, x = air_time)) +
geom_point(color = "black", fill = "gray", shape = 21, size = 3) +
geom_smooth(method = lm, formula = y ~ x) +
labs(
y= "Flight Distance (miles)",
x = "Flight Time (minutes)",
title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
) + theme_bw()
That’s a strong linear trend!
Let’s make the points and lines different for the different airports.
ggplot(data = flights, aes(y = distance, x = air_time)) +
geom_point(aes(color = origin, fill = origin, shape = origin), size = 3) +
geom_smooth(method = lm, formula = y ~ x, aes(color = origin)) +
labs(
y= "Flight Distance (miles)",
x = "Flight Time (minutes)",
title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
) + theme_bw()
Wouldn’t it be nice to specify what the colors, shapes, etc. should be? We can do that!
We can use a pre-build color palette, like these: (http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer)
Or we can define our own, using these colors: (http://sape.inf.usi.ch/quick-reference/ggplot2/colour)
Instead of typing the whole graph code over and over, we’ll give the graph a name and then add to it, like so:
flightscatter <- ggplot(data = flights, aes(y = distance, x = air_time)) +
geom_point(aes(color = origin, fill = origin, shape = origin), size = 3) +
geom_smooth(method = lm, formula = y ~ x, aes(color = origin)) +
labs(
y= "Flight Distance (miles)",
x = "Flight Time (minutes)",
title = expression(paste(bold("Figure 7. "), "Scatter Plot of Flight Distance by Flight Duration"))
) + theme_bw()
flightscatter +
scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) +
scale_fill_manual(values = c("coral1", "dodgerblue1", "grey")) +
scale_shape_manual(values = c(21, 22, 24))
This allows you to mess around with colors without re-doing the whole graph each time.
Let’s also rename and move the legend. How did I do those two things?
flightscatter <- flightscatter +
scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) +
scale_fill_manual(values = c("coral1", "dodgerblue1", "grey")) +
scale_shape_manual(values = c(21, 22, 24)) +
labs(color = "Airport", fill = "Airport", shape = "Airport") +
theme(
legend.position = c(0.10, 0.80)
# These numbers represent a proportion of the chart area,
# first the x coord, then the y
)
flightscatter
That looks pretty good. Now it’s your turn!
Let’s discuss faceting for a moment. To facet a graph is to create multiple separate plot areas. The code below divides the plot into three plots, one for each of the 3 airports. They are distributed horizontally. You could switch the . and the origin to arrange the facets vertically instead. And you could add another variable in place of the . to make a grid of graphs. See also facet_wrap().
flightscatter + facet_grid(. ~ origin)
Let’s make a bar graph. This takes a bit more setup than the others, because we usually want to plot the means, not the raw data. And if we want to add standard error bars (we do!), we’ll need to compute the standard errors as well.
library(plotrix) # for the std.error function
flights_means <- flights %>%
group_by(origin) %>%
summarise(duration = mean(air_time, na.rm = TRUE), se = std.error(air_time))
Now to the actual plot. Note the stat = “identity”. By default, geom_bar plots the count of the data points (i.e. how many). So, without stat = “identity”, the bar chart would show us how many flights there were from each airport. But we don’t want that right now. Instead, we want to show the mean, which is the value in the data, the identity of the number in the cell. So we say so. Get it?
We also want the error bars to extend from the means (duration) up and down one standard error. Hence aes(ymin=duration-se, ymax=duration+se).
ggplot(data = flights_means, aes(x = origin, y = duration)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin=duration-se, ymax=duration+se),
width=.2, # Width of the error bars
position=position_dodge(.9))
I’ll worrk about making this one pretty later.
Let’s move on to box plots. The first difference is that we need to define both an X and Y axis. We’ll plot flight distance (Y axis) by origin (i.e. what airport the flight left from; X axis).
ggplot(data = flights, aes(y = distance, x = origin)) +
geom_boxplot(color = "darkblue", linetype = 2, fill = "lightblue") +
labs(
y = "Flight Distance (miles)",
x = "Airport",
title = expression(paste(bold("Figure 4. "), "Box Plot of Flight Distance by Airport"))
) + theme_bw()
geom_violin works just like geom_boxplot, so to make a violin plot we’ll just change the geom name:
ggplot(data = flights, aes(y = distance, x = origin)) +
geom_violin(color = "darkblue", linetype = 2, fill = "lightblue") +
labs(
y = "Flight Distance (miles)",
x = "Airport",
title = expression(paste(bold("Figure 5. "), "Violin Plot of Flight Distance by Airport"))
) + theme_bw()
To combine violin and box plots, we have to adjust their widths.
ggplot(data = flights, aes(y = distance, x = origin)) +
geom_violin(color = "black", linetype = 1, fill = "gray", width = 1.4) +
geom_boxplot(color = "darkblue", linetype = 2, fill = "lightblue", width = 0.02) +
labs(
y = "Flight Distance (miles)",
x = "Airport",
title = expression(paste(bold("Figure 5. "), "Violin Plot of Flight Distance by Airport with Box Plot"))
) + theme_bw()
Now let’s try some more advanced graph-making, including facets and themes.
TitanicData <- data.frame(Titanic)
https://upload.wikimedia.org/wikipedia/commons/4/4f/Titanic_the_sinking.jpg
Like so: * Hint packages required include jpeg and ggimage
Here are some other graph types you may need to make someday.
Let’s make a line graph. We’ll plot how departure delays change across the year.
We must do a bit of data preparation for this, since we want to plot the means, not the raw data.
library(plotrix) # for the std.error function
month_means <- flights %>%
group_by(month, origin) %>%
summarise(departure_delay = mean(dep_delay, na.rm = TRUE), se = std.error(dep_delay, na.rm = TRUE))
And now on to the plot. We’ll put some points in (with geom_point()), and connect those points with geom_line().
Other things I did (can you figure out which command does these things?):
ggplot(data = month_means, aes(y = departure_delay, x = month)) +
geom_line(aes(color = origin, linetype = origin)) +
geom_point(aes(color = origin, fill = origin, shape = origin), size = 3, alpha = 0.75) +
geom_errorbar(aes(ymin=departure_delay-se, ymax=departure_delay+se),
width=.2, # Width of the error bars
position=position_dodge(.9)) +
labs(
y = "Departure Delay (minutes)",
x = "Month",
color = "Airport",
linetype = "Airport",
fill = "Airport",
shape = "Airport",
title = expression(paste(italic("Figure 7. "), "Line Plot of Flight Delay by Month"))
) + theme_bw() +
scale_color_manual(values = c("coral4", "dodgerblue4", "grey22")) +
scale_fill_manual(values = c("coral1", "dodgerblue1", "grey0")) +
scale_shape_manual(values = c(21, 22, 24)) +
scale_x_continuous(labels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),
breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
) + theme(axis.text.x = element_text(vjust = 0.25, hjust = 1, angle = 90),
legend.position = c(0.78, 0.8))
A bubble plot is really just a scatterplot, except that the size of the points varies by some 3rd variable.
library(plotrix) # for the std.error function
flights_means <- filter(flights, carrier %in% c("AA", "B6", "DL", "EV", "MQ", "UA", "US")) %>%
group_by(origin, carrier) %>%
summarise(duration = mean(air_time, na.rm = TRUE), departure_delay = mean(dep_delay, na.rm = TRUE), se = std.error(air_time, na.rm = TRUE))
ggplot(data = flights_means, aes(y =carrier, x = origin)) +
geom_point(aes(size = departure_delay), color = "blue", shape = 21, fill = "lightblue") +
labs(
y= "Airline",
x = "Airport",
fill = "Departure Delay",
title = expression(paste(italic("Figure 9. "), "Bubble Plot of Departure Delay"))
) + theme_bw()
To get a grid-like heatmap, use geom_tile() or geom_raster()
ggplot(data = flights_means, aes(y =carrier, x = origin)) +
geom_raster(aes(fill = departure_delay)) +
scale_fill_gradient(low="black", high="red") +
labs(
y= "Airline",
x = "Airport",
fill = "Departure Delay",
title = expression(paste(italic("Figure 10. "), "Heat Map of Departure Delay"))
) + theme_bw()
Now you’re ready to make some graphs of your own!